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Abstract. We present a Kalman smoothing framework based on modeling errors using the 
heavy tailed Student's t distribution, along with algorithms, convergence theory, open-source general 
implementation, and several important applications. The computational effort per iteration grows 
linearly with the length of the time series, and all smoothers allow nonlinear process and measurement 
models. 

Robust smoothers form an important subclass of smoothers within this framework. These 
smoothers work in situations where measurements are highly contaminated by noise or include data 
unexplained by the forward model. Highly robust smoothers are developed by modeling measurement 
errors using the Student's t distribution, and outperform the recently proposed £j-Laplace smoother 
in extreme situations with data containing 20% or more outliers. 

A second special application we consider in detail allows tracking sudden changes in the state. It 
is developed by modeling process noise using the Student's t distribution, and the resulting smoother 
can track sudden changes in the state. 

These features can be used separately or in tandem, and we present a general smoother algorithm 
and open source implementation, together with convergence analysis that covers a wide range of 
smoothers. A key ingredient of our approach is a technique to deal with the non-convexity of the 
Student's t loss function. Numerical results for linear and nonlinear models illustrate the performance 
of the new smoothers for robust and tracking applications, as well as for mixed problems that have 
both types of features. 



1. Introduction. The Kalman filter is an efficient recursive algorithm for esti- 
mating the state of a dynamic system [18] . Traditional formulations are based on £ 2 
penalties on model deviations, and are optimal under assumptions of linear dynamics 
and Gaussian noise. Kalman filters are used in a wide array of applications including 
navigation, medical technologies, and econometrics TTj[25p9 . Many of these problems 



are nonlinear, and may require smoothing over past data in both online and offline 



applications to significantly improve estimation performance 15 . 
This paper focuses on two important areas in Kalman smoothing: robustness with 
respect to outliers in measurement data, and improved tracking of quickly changing 
system dynamics. Robust filters and smoothers have been a topic of significant interest 
since the 1970's, e.g. see [24]. Recent efforts have focused on building smoothers that 
are robust to outliers in the data [2l[3j[T4], using convex loss functions such as l x , 



Huber or Vapnik, in place of the £ 2 penalty 17 



There have also been recent efforts to design smoothers able to better track fast 



system dynamics, e.g. jumps in the state values. A contribution can be found in 21 
where the Laplace distribution, rather than the Gaussian, is used to model transition 
noise. This introduces an £ 1 penalty on the state evolution in time, resulting in an 



estimator interpretable as a dynamic version of the well known LASSO procedure 26 
For known dynamics, all of the smoothers mentioned above can be derived by 
modeling the process and the measurement noise using log-concave densities, taking 
the form 

p(-) oc exp(— /?(•)), p convex . (1.1) 
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Formulations exploiting (1.1 1 are nearly ubiquitous, in part because they correspond 
to convex optimization problems in the linear case. However, in order to model a 
regime with large outliers or sudden jumps in the state, we want to look beyond ( 1.1 1 
and allow heavy-tailed densities, i.e. distributions whose tails are not exponentially 
bounded. All such distributions necessarily have non-convex loss functions [6j Theorem 
2.1]. 

Several interesting candidates are possible, but in this contribution we focus on the 
Student's t-distribution for its convenient properties in the context of the applications 
we consider. The Student's t-distribution was successfully applied to a variety of 



functions 16 



robust inference applications in 19 , and is closely related to re-descending influence 



In this work, we propose a smoothing framework for several applications, including 
robust and trend smoothing. The T-Robust smoother is derived from a dynamic 
system with output noise modeled by the Student's t-distribution. This is a further 
robustification of the estimator proposed in |2] , which uses the Laplace density. The 
re-descending influence function of the Student's t guarantees that outliers in the 
measurements have less of an effect on the smoothed estimate than any convex loss 
function. In practice, the T-Robust smoother performs better than [2] for cases with a 
high proportion of outliers. The T- Trend smoother is similarly derived starting from a 
dynamic system with transition noise modeled by the Student's t-distribution. This 
allows T-Trend to better track sudden changes in the state. One may consider using 
both aspects simultaneously; in addition, practitioners need the ability to distinguish 
between different measurements based on prior information of measurement fidelity, 
and between different states based on prior knowledge of trend stability. 

In the context of Kalman filtering/smoothing, the idea of using Student's t- 
distributions to model the system noise for robust and tracking applications was first 
proposed in [13] . However, our work differs from that approach in some important 
aspects. First, our analysis includes nonlinear measurement and process models. 
Second, we provide a novel approach to overcome the non-convexity of the Student's 
t-loss function. Third, the approach we propose can be used to solve any smoothing 
problem that uses Student's t modeling for any process or measurement components. 



The basic approach differs significantly from the one proposed in 13 . 13 



proposes 

using the random information matrix (i.e. full Hessian) when possible, or its expectation 
(Fisher information) when the Hessian is indefinite. Instead, we propose a modified 
Gauss- Newton method which builds information about the curvature of the Student's 
t-log likelihood into the Hessian approximation, and is guaranteed to be positive 
definite. As we show in Section [5] the new approach is provably convergent, and 
unlike the approach in 13 uses information about the relative sizes of the residuals in 
computing descent directions. These differences make it more stable than methods 
using random information, and more efficient than methods using Fisher information. 
The major computational tradeoff in using non-convex penalties is that the loss 
function in the convex case is used directly |2j, i.e. is not approximated, whereas in the 
nonconvex case, the loss function must be iteratively approximated with a local convex 
approximation. This requires a fundamental extension of the convergence analysis. 

A conference proceeding previewing this paper appears in [5]. In the current 
work, we present a general smoothing framework that includes the two smoothers 
presented in [H] as special cases, together with a generalized convergence theory that 
covers the entire range of smoothers under discussion. We also provide an open-source 
implementation of the general algorithm [l], with a simple interface that enables 
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Fig. 2.1. Gaussian, Laplace, and Student's t Densities, Corresponding Negative Log Likelihoods, 
and Influence Functions. 



the user to customize which residual or innovation components to model using the 
Student's t penalty. Using this implementation, we present an expanded experimental 
section, and new experiments that show how robust and trend smoothing can be done 
simultaneously. Finally, we apply the smoothers to real data. 

The paper is organized as follows. In Section [2] we introduce the multivariate 
Student's t-distribution, review its advantages for error modeling over log-concave 
distributions, and introduce the dynamic model class of interest for Kalman smoothing. 
In Section [3j we describe a statistical modeling framework, where we can use Student's 
t to model any process or measurement residual components. We describe all objectives 
that can arise this way, and provide a comprehensive method for obtaining approximate 
second order information for these objectives. In Section [4j we provide details for three 
important special smoothers: T- Robust (robust against large measurement noise), 
T- Trend (able to follow sharp changes in the state), and the Double-T smoother 
(incorporates both aspects). In Section [5j we present the algorithm and a convergence 
theory for the entire framework, which also extends the convergence theory developed 
in [2j. In Section [6j we present numerical experiments that illustrate the behavior of 
all three special smoothers, include illustrations of linear and nonlinear models, and 
results for real and simulated data. We end the paper with concluding remarks. 

2. Error Modeling with Student's t. For a vector u € K™ and any positive 
definite matrix M € R nxn , let := V u T Mu. We use the following generalization 

of the Student's t-distribution: 

pfofclAO = - 7 „,i /2 I 1 + r-^— I I 2 - 1 ) 



r(f )det[-KsR] 



where \x is the mean, s is the degrees of freedom, m is the dimension of the vector 
v k , and R is a positive definite matrix. A comparison of this distribution with the 
Gaussian and Laplacian distribution appears in Figure [5J Note that the Student's 
t-distribution has much heavier tails than the others, and that its influence function 



is re-descending, see 20 for a discussion of influence functions. This means that as 
we pull a measurement further and further away, its 'influence' decreases to 0, so it is 
eventually ignored by the model. Note also that the ^-Laplace is peaked at 0, while 
the Student's t-distribution is not, and so a Student's t-fit will not in general drive 
residuals to be exactly 0. 

Before we proceed with the Kalman smoothing application, we review a result 
from |6j, illustrating the fundamental modeling advantages of heavy tailed distributions: 

Theorem 2.1. Consider any scalar density p arising from a symmetric convex 
coercive and differ •entiable penalty p via p(x) — exp(—p(x)), and take any point t with 
p (t ) = a > 0. Then for all t 2 > t l > t , the conditional tail distribution induced by 
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p(x) satisfies 



Pr(M > t 2 | |y| > t x ) < exp(-a [i 2 - t x ]) . 



(2.2) 



When t x is large, the condition \y\ > t\ indicates that we are looking at an outlier. 
However, as shown by the theorem, any log-concave statistical model treats the outlier 
conservatively, dismissing the chance that \y\ could be significantly bigger than 
Contrast this behavior with that of the Student's t-distribution. When v = 1, the 
Student's t-distribution is simply the Cauchy distribution, with a density proportional 
to 1/(1 + y ). Then we have that 



lim Pr( 

t— >oo 



V 



> 2t 



f - arctan(2i) 1 

\y\ >t)= lim \ —r = -. 

m ' t^ao f - arctan(i 2 



Heavy tailed distributions thus provide a fundamental advantage in cases where outliers 
may be particularly large, or, in the second application we discuss, very sudden trend 
changes may be present. 

We now turn to the Kalman smoothing framework. We use the following general 
model for the underlying dynamics: for k = 1, . . . , N 



(2.3) 



with initial condition g\(x Q ) = g 



with g a known constant, and where g k 



K — > R are known smooth process functions, and h k :M. — > R are known smooth 
measurement functions. Moreover, w k and v k are mutually independent, and with 
known covariance matrices Q k £ R nx ™ and R k € R mxm , respectively. Note that here 
we assume all the measurement vectors have consistent dimension m. There is no 
loss of generality compared to the standard model where the dimensions depend on 
k, since any measurement vector can be augmented to a standard size to, and then 
the phantom measurements can be disabled using the modeling interface (by setting 
corresponding columns and rows of R k 1 to 0.) 

We now briefly explain how to use Student's t error modeling to design smoothers 
with two important characteristics. In order to obtain smoothers that are robust to 
heavily contaminated data, the vector v k € R rn ^ can be modeled zero-mean Student's 
t measurement noise (2.1 1 of known covariance R k € ^jn{k)xm{k) an( j degrees of 
freedom s. To design smoothers that can track sudden changes in the state, the 
process residuals w k are modeled using Student's t noise. These features may be 
employed separately or in tandem, and we always assume that the vectors {w k } U {v k } 
are all mutually independent. 

In the next section, we design a smoother that finds the MAP estimates of {x k } 
for a general formulation, where Student's t or least squares modeling can be used for 
any or all process and measurement residuals. We then specialize it to recover the 
applications discussed above. 



3. Generalized Smoothing Framework. 

{u k } and matrices {T k } we use the notation 



Given a sequence of column vectors 



vec({u fe }) = 



u N 



, diag({T fc }) = 









T 2 





Tn. 
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We also make the following definitions: 



R = diag({i? fc }) 
Q = diag({QJ) 
x = vec({x k }) 



w(x) = vec({x k - g k {xh-i)}) 
v(x) = vec({z k - h k (x k )}). 



In the most general case, we suppose that any of the components w k or v k components 
can be modeled either using Gaussian or Student's t distributions. 

For the sake of modeling clarity, assume that subcomponents of measurement and 
innovation residuals are consistently modeled across time points k; this gives the user 
the ability to select which subvectors of process and measurement residuals to model 
using Student's t, but not to assign different penalties to different time points. 

G S 

Denote by w k and w k the subvectors of the innovation residuals w k , and denote 
by v k and v k the subvectors of the measurement residuals v k that are to be modeled 
using the Gaussian and Student's t distributions, respectively. Assume that all of 
these subvectors are mutually independent, and denote the corresponding covariance 

G S G S 

submatrices by Q k , Q k , R k , and R k . Maximizing the likelihood for this model is 
equivalent to minimizing the associated negative log likelihood 

- lnp({f*T}, {"h }) {wk}, {^fe }), 



which can be explicitly written as follows: 

N 



E*kn 



k=l 



1 



1 S\\2 

Ml (fl f)- 



in 



(3.1) 

s s 

where s and r are degree of freedom parameters corresponding to v k and w k . 

A first-order accurate affine approximation to our model with respect to direction 
d = vec{d k } near a fixed state sequence x is given by 



w(x; d) 
v{x; d) 



vec({x k - g k (x k -i) - g k (x k -i)d k }), 
vec({z k - h k {x k ) - h k 1] (x k )d k }). 



Set Qjv + i = /„ and gM+i( x N) = (where I n is the n x n identity matrix) so that the 
formulas are also valid for k = N + 1. 



We minimize the nonlinear nonconvex objective in (3.1) by iteratively solving 
quadratic programming (QP) subproblems of the form: 



min ^d T Cd + a T d w.r.t d G M™ 



A? 



(3.2) 



where a is the gradient of objective (4.1 1 with respect to x and C has the form 



C = 



C\ + E x 




An 











Cn + Hn Ao 



A N C 



N 



H 



N 



(3.3) 



Note that this matrix is symmetric block tridiagonal. This structure is essential to the 
computational results for a wide variety of Kalman filtering and smoothing algorithms; 



it was noted early on in 12 31 . 
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G S 

In order to fully describe Cu and A k , first let W , W denote the indices associated 

G S 

to all subvectors w k and w k within w k . For example, if the Student's t density is 
used for all measurement residuals, and the Gaussian penalty is used for all process 
residuals, then W G = {1, . . . , n}, W S = 0. 

Now define with A k , C kl H k € M. nxn as follows: 



A k {W s ,W s ) 



A k (W G ,W G ) 
C k (W G ,W G ) 



C k (W s ,W s )- 



r+\\«%t {Q l r i 
-(Qtr\g k 



l (9 



k+ll 1 
(1) \S 



(<2 G ) _1 



I Sii2 



] (QkY 



r + \\w S k+ i || ( Qf+i) 

//, (1)nSnT, d Sx-1/, (1)nS 

^ } ^ ( " fc } + ((4 1) ) G ) T (^)" 1 (4 1) ) G 

(S + lFfe|| (i jS r i) 



(3.4) 



(3.5) 



The entries of A k and C k not explicitly defined in (3.4) and (3.5) are set to 0. 

The Hessian approximation terms H k in (3.6) are motivated in Section [5] and 
are crucial to both practical performance and theoretical convergence analysis. The 
solutions to the subproblem (3.2) have the form d = —C l a, and can be found in an 
efficient and numerically stable manner in 0(nN) steps, since C is tridiagonal and 
positive definite (see [8]). 

4. Special cases. We know show how the general framework of the previous 
section can be specialized to obtain three smoothers. The first two are T-Robust and 
T- Trend, which are presented in [5] . The third is a new smoother where all residuals 
and innovations are modeled using Student's t. 

The objective corresponding to T-Robust is obtained from (3.1 ) by taking w k = w k , 
w k = 0, v G = 0, v k = v k : 



1 N 

-Y 

fe=i 



s In 



1 + 



\Wk\ 



(4.1) 



The terms A k ,C k ,H k in (3.4) — (3.6) become 



A k = 
C h = 

H k = 



K 1 



(a { ' 

\i)k + l 

t /l,(l)\Tn-l, (1) 

s(h k ) K k h k 

(* + IML0 



O- 1 a W 
Vfe+i5fc+i 



(4.2) 



The objective corresponding to T- Trend is obtained from (3.1 ) by taking w k = 0, 



Wk = w k , v G 



v k , v k = 0: 



1 N 

-Y 

2 ^ 

k=l 



r In 



\w k \ 



,2 

'fit 



(4.3) 
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The terms A k ,C k ,H k in (3.4) — (3.6) become 



A h = - 



r + \\w k \ 
-l 



Ci. — 



r(3i+i) T Qfe+iffl+i 



r+IKIIg-i r + \\w k+1 \\ Q -i 



H k = (h^fR- k 1 h{ 1) . 



(4.4) 



Finally, we can apply Student's t to all process and measurement residuals by 



G~ S ' G S 

taking w k = 0, w k — w k , v k = 0, v k — v k to obtain 



■ r fe ln 



k=l 



\\w k \\ 



n 2 



The terms A k ,C k ,H k in (3.4) — (3.6) become 



C, 



H h = 



rQk 1 ^ 
r+\\w k f Q - 

rQk 1 ^ _ 
(s+\\vk\\U) 



r-(<7i+i) T Q fc +iff£+i 



(4.5) 



(4.6) 



5. Algorithm and Global Convergence. When models g k and h k are linear, 
we can compare the algorithmic scheme proposed in the previous sections with the 



method in 13 . The latter uses the random information matrix (random Hessian) in 



place of the matrix C defined above, and recommends using the expected (Fisher) 
information when the full Hessian is indefinite. When the densities for w k and v k are 
Gaussian, this is equivalent to using Newton's method when possible, and using Gauss- 
Newton when the Hessian is indefinite. In general, using the expected information is 
known as the method of Fisher's scoring. In the Student's t-case, the scalar Fisher 
information matrix is computed in 



19 to be 



1 



(5.1) 



where a 2 is the variance and s is the degrees of freedom. The authors of 



13 



proposed 
Imple- 
menting this approach would effectively replace the terms ||wfc||2 or 1 1 1 1 2 j present in 
the denominators of H k and A k (see 4.2 and 4.4 1, with terms that depend only on s k 



using (5.1 1 as the Hessian approximation when the full Hessian is indefinite 

1 2 11 1 1 2 



and r k , the degrees of freedom. So while the random information (Hessian) matrix 
can become indefinite, the Fisher information is insensitive to outliers, and fails to 
down-weigh their contributions to the Hessian approximation. 

To overcome these drawbacks, and find a middle ground between using the full 
Hessian and using a very rough approximation, we propose a Gauss-Newton method 
that is able to incorporate the relative size information of the residuals into the Hessian 
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approximation. In the rest of this section we provide the details for the application of 
this method and a proof of convergence. 

As in [2 , the convergence theory is based upon the versatile convex-composite 
techniques developed in [9]. We begin by choosing the convex-composite structure for 
objective ( |3.1[ ). We write it in the convex-composite form K = p o F, with smooth F 
and convex p: 




II S II 2 

KII ( b2)-i 



JY 



k=l 



kfll(Qf)- 



(5.2) 
(5.3) 

(5.4) 



Note that the range of / is R + , and p is coercive on its domain. The terms indexed 



with superscript S in (3.5 ) and (3.6 1 combine to form a positive definite approximation 



to the Hessian of /. To see this, consider the scalar function 



1 2 
k(x) := — ln(l + x jr) . 

The second derivative of this function in x is given by 



(r + x 2 ) — 2x 2 
(r + x 2 ) 2 



( i 2x2 

(r + x ) 



(5.5) 



and is only positive on (— y/r, \/r). There are two reasonable globally positive approxi- 
mations to take. The first, 



(r + x 2 ) 2 



simply ignores the subtracted term —x . In practice, we found this approximation 



to be too aggressive. Instead, we drop the 2x from the left of (5.5) to obtain the 
approximation 



1 



(r + x 2 ) 2 (r + x 2 ) 



(5.6) 



Similarly, the terms indexed by superscript S in (3.5) and (3.6) provide globally 



positive definite approximations to the Hessian of /, using the strategy in (5.6). This 



strategy offers a significant computational advantage — the Hessian approximation that 
is built up down-weights the contributions of outliers, helping the algorithm proceed 
faster to the solution. As we shall see, these terms are also essential for the general 
convergence theory. 

Our approach exploits the objective structure by iteratively linearizing F about 
the iterates x k and solving the direction finding subproblem 



p(F{x k 



F 



x k )d) 



\d T U{x k )d, 



(5.7) 



ROBUST AND TREND FOLLOWING STUDENT'S T KALMAN SMOOTHERS 



9 



where U(x ) is a symmetric positive semidefinite matrix that depends continuously on 



k 

X . 



For any smoother in the framework of secti on [j) problem (5.7) can be solved with 



a single block-tridiagonal solve of the system (3.2 1, yielding descent directions d for 
the objective K{x). 

We now develop a general convergence theory for convex-composite methods to 
establish the overall convergence to a stationary point of K(x). This theory is in the 
spirit of [2] and [9], and allows the inclusion of the quadratic term ^d T U(x k )d in ( 5.7 1 . 
This term was not necessary in [2] , but is crucial here. Note that the theory does not 
rely at all on the technique used to solve the direction finding subproblem, and so the 
theory in this paper applies to the algorithm in [2] by taking {7 = 0. 

Recall from [9] that the first-order necessary condition for optimality in the convex 
composite problem involving K(x) is 

0e dK(x) =dp(F(x))F {1 \x) 



where dK(x) is the generalized subdifferential of if at a; 23 and dp{F{x)) is the 
convex subdifferential of p at F(x) [22]. Elementary convex analysis gives us the 
equivalence 

£ dK(x) <^ K(x) = inf p ( F(x) + F (1) (x)d ) . 
For the general smoothing class of interest, it is desirable to modify this objective 



by including curvature information, yielding the problem (5.7). We define the difference 
function 

A(x; d) = p ( F(x) + F (1) (x)d ) + ^d T U(x)d - K{x) , (5.8) 

where U(x) is positive semidefinite and varies continuously with x. Note that A(x; d) 
is a convex function of d that is bounded below, hence the optimal value 

A*(x) = inf A(x; d) (5.9) 

d 

is well defined regardless of the existence of a solution. If A* (a;) = 0, then £ 
argmin A(a;; d). Hence, by Theorem 3.6], A*(x) = if and only if £ dK{x). 

Given 7/ £ (0, 1), we define a set of search directions at x by 

D{x,rf) = { d I A(x;d) < r]A*(x) } . (5.10) 

Note that if there is a d £ D(x, rj) such that A(x; d) > —rye, then A*(x) > —e. These 
ideas motivate the following algorithm. 

Algorithm 5.1. Gauss-Newton Algorithm. 

The inputs to this algorithm are 

• x° £ M. n : initial estimate of state sequence 

• e > 0: overall termination criterion 

• r/ £ (0, 1): search direction selection parameter 

• P £ (0, 1): step size selection parameter 

• 7 € (0, 1): line search step size factor 
The steps are as follows: 

1. Set the iteration counter v = 0. 
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2. (Gauss-Newton Step) Find d v in the set D[x ,7]) in 5.10 Set A„ = A(x l/ ;d") 
in 1 5.8\ and Terminate if A„ > —e. 

3. (Line Search) Set 

t v = max 7* 

s.t. i £ {0, 1, 2, ■ ■ ■ } and 

s.t. p (F(x v + f <f)) < p (F(x")) + PlK- 

4- (Iterate) Set x u+1 = x u + t u d" and return to Step^ 

We now present a general global convergence theorem that covers any smoother in 
section[3] This theorem also generalizes [2j Theorem 5.1] to include positive semidefinite 
curvature terms in the Gauss-Newton framework. 

Theorem 5.1. Define 



A := {u\p(u) < K{x°)} 



(5.11) 



and suppose that there exists a r > such that F is bounded and uniformly 
continuous on the set 



5 :=co(F- 1 (A)) + 



(5.12) 



If {x v } is a sequence generated by the Gauss-Newton Algorithm 5.1 with initial point 
x° and e — 0, then one of the following must occur: 

(i) The algorithm terminates finitely at a point x with € dK(x^). 
(ii) The sequence \\d \\ diverges to +oo. 

(Hi) lim^gj A„ = lim„ e/ A*(x u ) = for every subsequence I for which the set 

{ d | v G / } is bounded. 
Moreover, if x is any cluster point of a subsequence I C Z + such that the subse- 
quence {d v \u E /} is bounded, then E dK(x). 

Proof. We will assume that none of (i) , (ii) , (iii) occur and establish a contradiction. 
Then there is a subsequence / such that 

sup || d v || < oo and supA„ < £ < . 

Since K(x u ) is a decreasing sequence that is bounded below by 0, w e kn ow that the 
differences K{x v+l ) - K{x v ) ->■ 0. Therefore, by Step 3) of Algorithm^ (t v A v 0, 
which implies that t veI — > 0. Without loss of generality we may assume that t v < 1 
and ^lld^ll < jt for all v E I. Hence for all v El, 

llFOr' + VT^O-^OOII <^7 _1 / F'{x v + st vl - l d v ) \\d v \\ds 

Jo 

< tM , 

where M is a bound on F over S . Let K be a Lipschitz constant for p over the 
compact set A + rMB. Again by Step 3) of Algorithm 5.1 for all v € I, 

< p(n* v + tvj-'dn) - p{f{x v )) 

< t vl - x A v + K\\F(x v + t vl - x d v ) - F(x v ) - ^7~ V^KII 



t vl l A v + t vl l K 



(F {1 \x y + st vl - 1 d v )~F (1 \x y )) d v 



ds 



< 



t„l 1 (A U + Kw(t ul ^I^IDH^I 
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where w is the modulus of continuity of F on S . Rearranging, we obtain 

o< (1 - /3) A y + i^t^ -1 K II )K II • 

Taking the limit for v G I, we obtain the contradiction < (1 — /?)£■ Hence, 
liniygj Aj, = 0, which implies that lim^ eI A*(x l/ ) — 0, since A u < r]A*{x v ) < 0. 

Finally, suppose that x is a cluster point of a sequence / C Z + for which {d v } is 
bounded. Without loss of generality, there exists a d such that (x v : d l/ ) veI — > (x,d). 
For all d € t*", 

where £7 = U(x u ). Taking the limit over J gives 

= p(F(x)+F^(x)d)+i\\d\\l-p(F(x)) 

< ^(/,(ir(x)+fW(5)d)+i||d|^-p(f(x))) , 

where {J = U(x). Since c? was chosen arbitrarily, it must be the case that A* (a;) = 0, 
which implies that £ dK by 9, Theorem 3.6]. □ 

A stronger convergence result is possible under stronger assumptions on F and 

Corollary 5.2. Suppose that F - (A) = { cc | F(x) e A} is bounded, and there 
exists < A min such that 

VieF'fA), < A min ||d|| 2 < d T U(x)d VdeNull(F (1) (x)) . (5.13) 



If {x u } is a sequence generated by Algorithm 5.1 with initial point x° and e — 0, then 
{x v } and {d v } are bounded and either the algorithm terminates finitely at a point x v 
with € dK{x 1 '), or A„ — > as v — > oo, and every cluster point x of the sequence 
{x v } satisfies <E dK(x). 

Proof. First note that F~ 1 (A) is closed since F is continuous, and therefore 
F^ 1 {A) is compact, since by assumption it is bounded. Hence S (see (5.12)) is also 
compact. Therefore, F^ 1 ' is uniformly continuous and bounded on S which implies 
that the hypotheses of Theorem 5.1 are satisfied, and so one of (i)-(iii) must hold. 
If (i) holds we are done, so we will assume that the sequence {x v } is infinite. Since 
{x u } C F~ (A), this sequence is bounded. We now show that the sequence {d u } of 
search directions is also bounded. 

Suppose that (5.13) holds. For any direction d v ', note that d v satisfies 

p (F{x v ) + F^\x v )d v ) + \\\d v \\l» <p{F{x v )) < p(F(x )) . (5.14) 
Since p > 0, we have 

{F(/)}CA and {F(x") + F (1) (x v )d"} C A (5.15) 

and 

^{d v ) T U v d v \ < p(F(x )^ \/u. (5.16) 
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Suppose that the {cf} is unbounded. Then without loss of generality, there exists a 
subsequence /, a unit vector u, and a vector x G F X (A) such that lim„ eI d v /\\d v \\ — > u 
and lining/ x v — > x. Since A is bounded, (5.151 implies that F {x)u 
Nul(F (1) (x)), and therefore 



0, so u e 



On the other hand 
have the contradiction 



by (|5T6|, ^(^)V 



< 



P(f (a )) 

Hd" || 2 



and so in the limit we 



< A min < u U{x)u < . 

Hence d v are bounded. The result now follows from Theorem l5.ll 
□ 

We now show that all smoothers of section [3] satisfy the required assumptions of 
Theorem 5.1 and Corollary (5.2). 

Corollary 5.3 (Smoother Satisfaction). Suppose that the process and measure- 
ment functions g k and h k in (2.3) are twice continuously differentiable. Then for F 
given in (5.3), F^ 1 ' is bou nded and uniformly continuous on S Q in (5.12). Moreover, 
the hypotheses of Corollary 5.2 hold if for all x in F 1 (A) and for all k, there exists rj 
such that 



< X] < er n 



G s (x) 



1 (w-i) 

-(9k 



I 



Proof. We first show that both A and F _1 (A) are bounded. The first claim 
follows immediately by the coercivity of p in (5.2). To verify the second claim, we 
will show that for any sequence of x v with ||a; || — > 00, we can find a subsequence J 
such that lim^gj \\w"\\ = 00, which implies the existence of subsequence I such that 



either \im ueI \\w \\ = 00 or lim„ e/ f[x u ) — 00. In particular there does not exist an 
unbounded sequence {x u } with F(x v ) C A , and therefore F _1 (A) must be bounded. 



if I!*"! 



00, we can find an index fee [1, . . . , N] and subsequence J such that 
lim^gj \\XkW —00. Now, either lim^gj W/. — 00 and we are done, or lim^g j \\9k( x k-i)\\ = | 
00, so lining j llx^xH = 00. Iterating this argument, we arrive at the limiting case 
Wi = x\ — Xi, and so if all \\Wj\\ are bounded for j > 1, we can guarantee that 
lim^j || || = 00. 

Since F is twice continuously differentiable by the hypotheses on g and h, the 
boundedness of F _1 (A) establishes the boundedness and uniform continuity of F^ 
on S in (5.12) for any r > 0. 



It remains to show that condition (5.13) is satisfied. Let W , W denote the 
indices associated to all subvectors w k and w k within w k . If d e Null(-F^(£)), then 



necessarily d w G = 0. This is simply because F^ is nonsingular on , since it 
contains the sub matrix 



G G (x) := 
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which is the standard process matrix G projected to those coordinates where Gaussian 
modeling is applied. To finish the analysis, we present the full form of the matrix U 
restricted to W : 



U = 







A 2 U 2 A3 1 

'■• '•• 



A 



N 



u, 



N 



diag({# fc }, 



(5.17) 



where 



Au-- 



U k = 



H,- 



r (Qk) (9k ) 
r +11^11^-1 

K(gi 1 ii) s ) T (Qg + i)' 1 (.9i 1 ii) 5 

r+\\w s k+1 f {Q s +i) -i 

(s+\\vif {R s r i) 



r+\\w S k f {Q s rl 



(5.18) 



Note that we can write the first summand in (5.17) as (G S ) T Q 1 G S , where 
Q- 1 :=dmg({Q^}), Q, 1 = 



r(Q s k r x 



, II Sii2 

r + \\w k \\ (Q s y i 

Since F 1 (A) is bounded, the denominators of Q^ 1 are bounded, and so eigenvalues of 
Q k are bounded from below, and the singular values of G are bounded from above. 
We now have 

s s 

< T)mm < °mm(G ) < Omax(G ) < Imax 

for all x, where the upper found follows from Theorem El 2.2] together with compactness 
ofF _1 (A). 

Then, by Hi Theorem 2.1], we have 

«((G S ) T Q- 1 G 5 ) < A — 



(Q )Vm\ 



for all x 6 F (A). This completes the proof. □ 

Remark 5.2. One can also consider conditions o n the individual g k that can 
produce a lower bound 77 on G S , as required by Corollary 



5.3{ One such condition is 
(5.19) 



< V < {i + CT min(s[.+i) ~ O-maxCSfc^) ~ ^maxCfii+l)} 

If this condition is satisfied, then by Theorem 2.2], 7] < a min (G S ). However, this 
condition is sufficient, and may not be necessary. 

6. Numerical Experiments. 
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6.1. T-Robust Smoother: function reconstruction using splines. In this 
section we compare the new T-robust smoother with the £ 2 -Kalman smoother [8] and 
with the ^-Laplace robust smoother |2], both implemented in [l]. The ground truth 
for this simulated example is 

x(t) = [— cos(i) — sin(i)] T . 

The time between measurements is a constant At. We model the two components of 
the state as the first and second integrals of white noise, so that 

At At 2 /2 
At 2 /2 At 3 /3_ ' 

This stochastic model for function reconstruction underlies the Bayesian interpretation 
of cubic smoothing splines, see [28] for details. 

The measurement model for the conditional mean of measurement z k given state x k is 
defined by 

h k {x k ) = [0 l]x h = x^ k , R k = o 2 , 

where x 2 k denotes the second component of x k , a 1 — 0.25 for all experiments, and 
the degrees of freedom parameter k was set to 4 for the Student's t methods. 
The measurements {z k } were generated as a sample from 

z k = x 2 (t k ) + v k) t k = 0.04tt x k 

where k = 1,2,..., 100. The measurement noise v k was generated according to the 
following schemes. 

1. Nominal: v k ~ N(0, 0.25). 

2. Gaussian contamination 

« Jk ~(l-p)N(0,0.25)+pN(0,^), 
for p G {0.1, 0.2, 0.5} and <j) £ {1, 4, 10, 100}. 

3. Uniform contamination 

v k - (1 -p)N(0,0.25) +pU(-10,10), 

forpe {0.1,0.2,0.5}. 
Each experiment was performed 1000 times. Table |6T] presents the results for our 
simulated fitting showing the median Mean Squared Error (MSE) value and a quantile 
interval containing 95% of the results. The MSE is defined by 

1 N 

jr ^2[ x i(h) - ^i,fe] 2 + [x 2 (t k ) - £2,fc] 2 , (6-1) 
fc=l 

where {x k } is the corresponding estimating sequence. 

From Table [O] one can see that T- Robust and the ^-smoother perform as well as 
the (optimal) ^ 2 " sm0 °ther at nominal conditions, and that both continue to perform 
at that same level for a variety of outlier generating scenarios. T-Robust always 



9k(xk-i) = 



1 
At 1 



Xk-l 



Q k = 
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Table 6.1 

Function reconstruction via spline: median MSE over 1000 runs and intervals containing 95% 
of MSE results. 



Outlier 


P 


£ 2 MSE 


t x MSE 


Student's t MSE 


Nominal 




.04(.02, .1) 


.04(.01, .1) 


.04(.01, .09) 


N(0,1) 


.1 


.06(.02, .12) 


.04(.02, .10) 


.04(.02, .10) 


N(0,4) 


.1 


.09(.04, .29) 


.05(.02, .12) 


.04(.02, .11) 


N(0, 10) 


.1 


.17(.05, .55) 


.05(.02, .13) 


.04(.02, .11) 


N(0, 100) 


.1 


1.3(.30, 5.0) 


.05(.02, .14) 


.04(.02, .11) 


U(-10, 10) 


.1 


.47(.12, 1.5) 


.05(.02, .13) 


.04(.02, .10) 


N(0, 10) 


.2 


.32(.ll, .95) 


.06(.02, .19) 


.05(.02, .16) 


N(0, 100) 


.2 


2.9(.94, 8.5) 


.07(.02, .22) 


.05(.02, .14) 


U(-10,10) 


.2 


1.1(.36, 3.0) 


.07(.03, .26) 


.05(.02, .13) 


N(0, 10) 


.5 


.74(.29, 1.9) 


.13(.05, .49) 


.10(.04, .45) 


N(0, 100) 


.5 


7.7(2.9, 18) 


.21(.06, 1.6) 


.09(.03, .44) 


U(-10,10) 


.5 


2.6(1.0, 5.8) 


.20(.06, 1.4) 


.10(.03, .44) 




Fig. 6.1. Function reconstruction via spline: performance of £2 Kalman smoother (dash), 
£ 1 -Laplace Robust smoother (dash-dot), and T-Robust (stair-case solid) on contaminated normal 
model with 50% outliers distributed uniformly on [-10,10]. True state x(t) is drawn as solid line. 
Measurements appear as 'o ' symbols, and all measurements visible off of the true state are outliers 
in this case. Values outside [-5,5] are plotted on the axis limits. 



performs at least as well as the ^-smoother, and it gains an advantage when either 
the probability of contamination is high, or the contamination is uniform. This is 
likely due to the re-descending influence function of the Student's t-distribution — the 
smoother effectively throws out bad points rather than simply decreasing their impact 
to a certain threshold, as is the case for the i^-smoother. As an example, results 
coming from a single run for the case where 50% of measurements are contaminated 



with the uniform distribution on [—10,10] are displayed in Figure 6.1 Notice that 
T-Robust has an advantage over the ^-smoother. 
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Fig. 6.2. Van Der Pol oscillator: smoother fits for X- component (left) and Y-component (right), 
with 70% outliers N (0,100). Black solid line is truth, magenta dash-dot is the i\ smoother result, 
and blue dashed line is T-robust. Measurements on X-component are shown as dots, with outliers 
outside the range [—5, 5] plotted on top and bottom axes. 



6.2. T-Robust Smoother: Van Der Pol oscillator. In this section, we 
present results for the Van Der Pol oscillator (VDP), described in detail in [2]. 
The VDP oscillator is a coupled nonlinear ODE defined by 

ii(t) = x 2 (t) 

x 2 {t) = - Xl (tf]x 2 (t) - x x (t) 



The process model here is the Euler approximation for X(t k ) given X(t k _ 



9k( x k-l) — 



x l,k-l + x 2,k-l^t 
x 2,k-l + M 1 - x l,k] x 2,k ~ EijJAt 



For this simulation, the ground truth is obtained from a stochastic Euler approximation 
of the VDP. To be specific, with fi = 2, V = 164 and At = 16/V, the ground truth 
state vector x k at time t k = kAt is given by x = (0, — 0.5) T and for k = 1, . . . , N, 
x k = 9k( x k-i) + w ki where {w k } is a realization of independent Gaussian noise with 
variance 0.01. 

In [2j, the fj-Laplace smoother was shown to have superior performance to the 
£ 2 -smoother, both implemented in [I]. We compared the performance of the nonlinear 
T-robust and nonlinear ^-Laplace smoothers, and found that T-robust gains an 



advantage in the extreme cases of 70% outliers. Figure 6.2 illustrates results coming 
from a single representative run. For 40% or fewer outliers, it is hard to differentiate 
the performance of the two smoothers for this nonlinear example. 

6.3. T-Robust Smoother: underwater Tracking Application. This appli- 
cation is described in detail in [2], so we just give a brief overview here. In [2] we 
used the application to test the fj-Laplace smoother. Here we use it for a qualitative 
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comparison between the T-Robust smoother, the fx-Laplace smoother, and the £ 2 
smoother with outlier removal. 

In this experiment, a tracking target was hung on a steel cable approximately 200 
meters below a ship. The pilot was attempting to keep the ship in place (hold station) 
at specific coordinates, but the ship was pitching and rolling due to wave action. The 
measurements for the smoother were sound travel times between the tracking target 
and four bottom mounted transponders at known locations, and pressure readings 
from a gauge that was placed on the target. Tracking data was independently verified 
using a GPS antenna mounted on a ship, and the GPS system provided sub-meter 
accuracy in position. 

Pressure measurements in absolute bars were converted to depth in meters by the 
formula 

depth = 9.9184(pressure - 1). 

We use N to denote the total number of time points at which we have tracking data. 
For k = 1, . . . , N, the state vector at time t k is defined by x k = (e k ,n k ,d k , e k ,n k , d k ) T 
where (e k ,n k ,d k ) is the ( east, north, depth ) location of the object (in meters from 
the origin), and (e k ,n k ,d k ) is the time derivative of this location. 

The measurement vector at time t k is denoted by z k . The first four components 
of z k are the range measurements to the corresponding bottom mounted transponders 
and the last component is the depth corresponding to the pressure measurement. For 
j = 1, . . . , 4, the model for the mean of the corresponding range measurements was 

hj,k(x k ) = \\{e k ,n k ,d k ) - 6J 2 - Ar 3 . 

These measurements were assumed independent with standard deviation 3 meters. 
These depth measurements were assumed to have standard deviation of 0.05 meters. 
We use At k to denote t k+1 — t k . The model for the mean of x k+1 given x k was 

(e k +e k At k , n k +h k At k , d k + d k At k , e k , n k , d k ) T . 

The process noise corresponding to east, north, and depth components of the condi- 
tional distribution of x k+1 given x k was assumed to be Gaussian, with mean zero and 
standard deviation .OlA^. The process noise corresponding to the derivative vector 
of east, north, and depth components of the conditional mean x k+1 given x k was also 
assumed Gaussian with mean zero and standard deviation .2At k . 



£ 2 -smoother results without outlier removal are shown in Figure 6.3 There are 
three large peaks (two in the east component and one in the north component of the 
state) that are due to measurement outliers, and require either an outlier removal 
strategy or robust smoothing. 



Three fits are shown in Figure 6.4 ^-Laplace, T-Robust, and £ 2 -smoother with 
outlier removal. The darker curves appearing below the track are independent verifica- 
tions using the GPS tracking near the top of the cable. A depth of 198 meters was 
added to the depth location of the GPS antenna so that the depth comparison can 
use the same axis for both the GPS data and the tracking results. Note that the time 
scale for the depth plots different (much finer) than the north, east, down plots, and 
demonstrates the accuracy of the GPS tracking as validated by the pressure sensor. 
T-Robust, like the ^-Laplace smoother, was able to use the whole data sequence, 
despite large outliers in the data. The fits look very similar, and it is clear that 
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2.53 
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Fig. 6.3. Track: Independent GPS verification (thick line and +), l^-smoother estimate (thin 
line). Note the large outliers in the data. 
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Fig. 6.4. Track: Independent GPS verification (thick line and +) and Residuals for (a): 
£ 1 -Laplace smoother (thin line), (b): T-Robust smoother (thin line), (c): £ 2 -smoother with outlier 
removal. 
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T-Robust can also be used for smoothing in the presence of outliers. Note that the 
T-Robust track (b) is smoother than the ^-Laplace track (a) but has more detail than 
the £ 2 -smoother track with outlier removal (c). This is easiest to see by comparing 



the east coordinates in (a), (b), and (c) of Figure 6.4 between 7.2 and 7.25 hours. 

The residual plots in Figure |6.4| are quite revealing. Outliers are defined as 
measurements corresponding to residuals with absolute value greater than three 
standard deviations from the mean. All outliers are shown as 'o' characters, and those 
that fall outside the axis limits are plotted on the vertical axis limit lines. Note that 
the £ 9 -smoother with outlier removal detects outliers after the first fit that are not 



outliers after the second fit. The peaks in Figure 6.3 are large enough to influence the 
entire fit, and so some points which are actually 'good' measurements are removed by 
the 3-cr edit rule, resulting in 'over-smoothing' of the outlier removal track and more 
detail in both of the robust smoothers in Figure |6.4| 

The ^-Laplace smoother pushes more of the residuals to zero, particularly those 
corresponding to depth measurements, which are the most reliable and frequent. The 
T-Robust smoother is somewhere in between — the residuals for the depth track are 
smaller in comparison to the residuals of the ^-smoother, but are not set to zero as 
by the ^-Laplace smoother. As discussed previously, these features are artifacts of 
the behavior of the distributions at zero, and the choice of smoother should be guided 
by particular applications. 

6.4. T- Trend Smoother: reconstruction of a sudden change in state. 

We present a proof of concept result for the T- Trend smoother, using two Monte 
Carlo studies of 200 runs. In the first study, the state vector, as well as the process 



and measurement models, are the same as in Sec. 6.1 At any run, x 2 has to be 



reconstructed from 20 measurements corrupted by a white Gaussian noise of variance 



0.05 and collected on [0, 2tt] using a uniform sampling grid. The top panel of Figure 6.5 
reports the boxplot of the 200 root-MSE errors for the £ 2 -, t\-, and T-Trend smoothers, 



while the top right panel of Figure 6.5 displays the estimate obtained in a single run. 
It is apparent that the performance of the three estimators is very similar. 

The second experiment is identical to the first one except that we introduce a 
'jump' at the middle of the sinusoidal wave. The bottom panel of Figure [675] reveals the 
superior performance of the T-Trend smoother under these perturbed conditions. The 



result depicted in the bottom right panel of Figure 6.5 for a single run of the experiment 
is representative of the average performance of the estimators. The estimate achieved 
by the £ 2 -smoother (dashed-line) does not follow the jump well (the true state is the 
solid line). The ^-smoother (dashdot) does a better job than the £ 2 -smoother, and 
the T-trend smoother outperforms the ^-smoother, following the jump very closely 
while still providing a good solution along the rest of the path. 

6.5. Reconstruction of a sudden change in state in the presence of 
outliers. Until now, we have considered robust and trend applications separately, in 
order to compare with previous robust smoothing formulations and to highlight the 
main features of the trend-filtering problem. A natural extension is to consider these 
features in tandem — in other words, can we smooth a track which has both outliers 
and a sudden change in state? In fact, smoothers of this nature (but exploiting convex 
formulations) have already been proposed [14] . 

The challenge to building such a strong smoother is that without prior knowledge, 
it is difficult to tell the difference between a bad measurement (an outlier) and a good 
measurement that may be consistent with a sudden change in the state. In many 
cases, the user will be aware that some sensors are reliable, while others are subject to 
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Nominal conditions Nominal conditions 




Perturbed conditions Perturbed conditions 




Fig. 6.5. Reconstruction of a sudden change in state obtained by £ 2 > ii, and T-Trend smoothers. 
Left: Boxplot of reconstruction errors under nominal (top) and perturbed (bottom) conditions. 
Right: Reconstructions obtained using £ 2 (dashed), (dashdot) and T-Trend (thin line) smoother. 
The thick line is the true state. 



contamination. This kind of prior information can now easily be incorporated using the 
generality and flexibility of section |3j so that the user may specify trustworthy sensors 
(by modeling corresponding residuals indices with Gaussians) as well as stable state 
components (by modeling corresponding innovation residual indices with Gaussians) . 
Note that this is very different from specifying which of the individual measurements 
are reliable, or which individual transitions follow the process model. 

In this section, we consider a situation where we have a trustworthy sensor s 1 and 
an occasionally malfunctioning sensor s 2 . Sensor s 2 gives frequent measurements, but 
some proportion of the time is subject to heavy contamination, while sensor s x gives 
measurements rarely, but they are trustworthy (i.e. only subject to small Gaussian 
noise). Using the flexible interface implemented in |fj, we can model s 1 errors as 
Gaussian and s 2 errors as Student's t. 

We use setup in section |6.4| together with the Gaussian outlier contamination 
scheme described in section |6.1| Both measurements are direct, so the measurement 
matrix in this case is 



H k x k 



1 
1 



x k 



Since in the ckbs interface, the user specifies R k rather than R k , missing measurements 
are easily specified by setting the corresponding component of R^ 1 to 0. 

For the contaminated sensor s 2 , we consider p = .2 contamination level, and 
4> = 100, very large contaminating variance. We have s 2 measurements at every time 
step, but s 1 measurements only at every 10th time step. 

The results are shown in figure |6.6[ Measurements are plotted using diamonds, 
with s 2 measurements represented by small symbols, while s 1 measurements are 
represented by large symbols. Ground truth is shown using a sold black line, and 
smoother results are shown using a red dashed line. Results in panel (a) were obtained 
using the least squares smoother, which cannot handle outliers. Results in panel 
(b) were obtained using T-Robust only, applying Student's t modeling only to the 
measurement components. The resulting fit is much better, but the smoother struggles 
to follow the jump in the track, overestimating the curve before the jump and under- 
estimating it after the jump. Results in panel (c) were obtained by the Double T 
smoother, which modeled all residuals and innovations using Student's t. Double T 
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Fig. 6.6. Tracking a sudden change in the presence of outliers. Measurements are plotted using 
diamonds, with s 2 measurements (frequent, contaminated) represented by small symbols, while Sj 
measurements (rare, reliable) represented by large symbols. Outliers appear on the axes when they 
are out of range of the plot limits. Ground truth is shown using a sold black line, and smoother 
results are shown using a red dashed line, (a): Least squares smoother (Gaussian errors for process 
and measurements) is very vulnerable to outliers, (b): T-Robust only (Gaussian errors for process 
components and s\, Student's t errors for s 2 ) effectively ignores the outliers, but struggles to follow 
sharp change in state, (c): Double T smoother (Student's t errors for all components): ignores 
outliers, but struggles to follow sudden change in state, (d): Trend following robust smoother 
(Student's t errors for process components and s 2 ; Gaussian for s-^): ignores outliers and follows 
sudden change in state, using information in the reliable measurements. 



follows the curve well before the jump, but not after. Note that there are a couple 
of measurements sitting along the sharp jump — the Double T suspects these to be 
outliers, only trusting the concentrated measurements to the right of the jump. 

Finally, results in panel (d) were obtained by using the information about which 
measurements are reliable. Specifically, Student's t modeling was used for all inno- 
vations residuals and for s 2 ? an d Gaussian modeling was used for the s 1 component. 
This smoother ignores the outliers and is able to follow the jump very well, since it 
takes advantage of the fact that there is a reliable measurement that happens to be 
sitting right in the middle of the transition. 

The file used to generate the subplots in the figure is noisy_jump_two_meas .m, 
which can be accessed through the example subdirectory of \U. 

7. Discussion and Conclusions. We have presented a generalized Student's t 
smoothing framework, which allows modeling any innovations or measurement residuals 
using Student's t errors, and includes T-Robust and T-Trend, and Double T smoothers 
as important special cases. All of the smoothers in the framework efficiently solve 
for the MAP estimates of the states in a state-space model with any selected set of 
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residuals modeled using Student's t or Gaussian noise. We have shown that these 
features can be used independently and in tandem, work for linear and nonlinear 
process models, and can be used both for outlier-robust smoothing and for tracking 
sudden changes in the state. 

Similar to contributions in other applications, e.g. sparse system identification 



10 27 30 , our results underscore the significant advantages of using heavy tailed 
distributions in statistical modeling. Heavy tailed distributions force the use of non- 
convex loss functions to solve for the associated MAP estimates [6j Theorem 2] . The 
consequent challenge is to optimize a non-convex objective even when the system 
dynamics are linear. In contrast to the convex case, this requires an iterative smoother. 
The convergence analysis for these methods is still developed within the general 
framework of convex-composite optimization [9], although the details of the analysis 
differ. 

Because the problems are non-convex, iterative methods may converge to local 
rather than global minima. This problem can be mitigated by an appropriate initial- 
ization procedure — for example, in the presence of outliers, the ^-Laplace smoother 
can be used to obtain a starting point for the optimizer, in which case we can improve 
on the l-y solution when the data is highly contaminated with outliers. This approach 
was not taken in our numerical experiments, which used the same initial points. For 
all the linear experiments, the initial point was simply the null state sequence. For 
the Van Der Pol, the initial state x was correctly specified in all experiments, and 
the remaining state estimates in the initial sequence were null. 

The T-Robust smoother compares favourably to the ^-Laplace smoother described 
in |2] , and outperforms it in our experiments when the data is heavily contaminated 
by outliers. The T-Trend smoother was designed for tracking signals that may exhibit 
sudden changes, and therefore has many potential applications in areas such as 
navigation and financial trend tracking. It was demonstrated to follow a fast jump in 
the state better than a smoother with a convex penalty on model deviation. Finally, 
we demonstrated the power of a new method by tracking a fast change in the presence 
of outliers using the full flexibility of the presented framework, which allowed us to 
differentially model residuals for sensors which we knew to be reliable vs. unreliable, 
and to design a smoother that was robust to outliers yet able to track sudden changes. 

An important question in the design and implementation of Student's t-based 
smoothers is how to estimate the degree of freedom parameter v. In all of our 
experiments, we have treated this parameter as fixed and know. We note that there are 
established EM-based methods in the literature for estimating these parameters 13p9 



as well as recently proposed methods Pf] , and we leave the implementation of these 
extensions in the Kalman smoothing framework to future work. 
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